40 ◾ Bioinformatics
The sequence length distribution warning also can be raised by clipping adaptors or
overrepresented sequences. Thus, we can use “fastx_clipper” first to remove the overrep-
resented sequences (see Figure 1.31). The following script removes a contaminating over-
represented sequence:
fastx_clipper \
-a ATCGGGAGAGGGGCGGGGAGGGGAAGAGGGGAGAATTCGGGGGGGGCCGG \
-i bad_filt_trim.fastq \
-o bad_filt_trim_clip.fastq \
-v \
-Q33
fastqc bad_filt_trim_clip.fastq
htmlfiles=$(ls *.html)
firefox $htmlfiles
Since some aligners in the next step of analysis may not accept sequences with unequal
lengths, we can use a bash script to filter out the short reads. Figure 1.34 shows sequence
length distribution. If the aligner that we intend to use does not accept unequal read
lengths, then we can filter out all reads whose length is less than 150 bases using the fol-
lowing script:
FIGURE 1.34 Sequence length distribution (different lengths).